This project builds upon the exploratory and descriptive analysis conducted in Project 1, which examined changes in obesity prevalence and lifestyle behaviors among U.S. adults before and after the onset of COVID-19 using BRFSS data (2018–2023).
Moving beyond trend analysis, Project 2 focuses on statistical modeling and prediction to better understand the factors associated with obesity and BMI in the post-COVID period.
Using nationally representative BRFSS data, we develop predictive and inferential models to assess how demographic, behavioral, and temporal factors relate to obesity outcomes.
In particular, we examine the roles of physical activity, alcohol use, and mental health, while controlling for key sociodemographic covariates.
These models aim to provide rigorous, data-driven insights into obesity risk and its behavioral correlates following the COVID-19 pandemic.
Data/Raw/,
obtained from the CDC BRFSS
Annual Data Portal.Data/Processed/.Data/Processed/brfss_2018_2023.rds (over 100k observations,
~24 variables).
Obesity is a major public health concern in the United States
and is associated with numerous chronic health conditions. Using
Behavioral Risk Factor Surveillance System (BRFSS) data from 2020–2023,
this study examines whether demographic and lifestyle characteristics
can be used to predict adult obesity status.
Research Question:
Can we predict the likelihood of an adult being obese (BMI ≥ 30)
based on demographic and lifestyle factors such as age, sex,
race/ethnicity, income category, physical activity, and alcohol
consumption?
Body Mass Index (BMI) was used to define obesity status
following CDC guidelines. A binary obesity indicator was created where
individuals with a BMI of 30 or higher were classified as obese, and
those with a BMI below 30 were classified as non-obese. This outcome
variable was represented both as a numeric indicator (0/1) for modeling
purposes and as a factor variable to support classification-based
methods.
brfss = readRDS('../Data/Processed/brfss_2018_2023.rds')
brfss_postcovid <- subset(brfss, interview_year >= 2020)
brfss_postcovid <- brfss_postcovid %>%
mutate(
obese_num = ifelse(bmi >= 30, 1, 0),
obese = factor(obese_num, labels = c("No", "Yes"))
)
Variables were selected based on prior public health literature
and their theoretical relevance to obesity risk. The final analytic
dataset includes demographic characteristics, lifestyle behaviors, and
temporal indicators that may influence obesity outcomes.
vars <- brfss_postcovid %>%
dplyr::select(obese, obese_num, bmi, age, sex, race_ethnicity, income_category,
any_exercise_last_month, binge_drinking,
max_drinks_30day, interview_year)
glimpse(vars)
## Rows: 87,662
## Columns: 11
## $ obese <fct> No, Yes, No, No, No, No, Yes, No, Yes, No, No,…
## $ obese_num <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0…
## $ bmi <dbl> 25.09, 30.41, 22.22, 23.63, 29.21, 25.06, 30.4…
## $ age <dbl> 67, 45, 57, 35, 49, 37, 56, 68, 65, 56, 73, 69…
## $ sex <fct> Male, Male, Female, Male, Male, Female, Female…
## $ race_ethnicity <fct> White NH, White NH, Black NH, Black NH, White …
## $ income_category <fct> 25-35k, <15k, >50k, <15k, >50k, <15k, >50k, 25…
## $ any_exercise_last_month <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes…
## $ binge_drinking <fct> Yes, Yes, No, Yes, No, Yes, No, Yes, No, Yes, …
## $ max_drinks_30day <dbl> 5, 5, 1, 3, 5, 9, 6, 4, 5, 2, 6, 4, 4, 5, 6, 3…
## $ interview_year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
numeric_vars <- vars %>%
dplyr::select(age, bmi, max_drinks_30day, interview_year)
spearman_corr <- cor(numeric_vars, use = "complete.obs", method = "spearman")
corrplot(spearman_corr, method = "color", type = "upper")
The Spearman correlation matrix indicates very weak monotonic
relationships among the numerical variables, with the strongest
relationships being a weak negative correlation between age and
max_drinks_30day (more alcohol consumed in one sitting is associated
with younger age) and a weak positive correlation between age and bmi.
densityplot(~ bmi | obese, data = vars)
This density plot shows the distribution of BMI for the non-obese
(“No”) and obese (“Yes”) classes. The plot visually confirms that while
the distributions are distinctly centered below and above the BMI 30
cutoff, the long, thin tail on the “Yes” side extends to very high BMI
values (80+) and overlaps slightly with the “No” class near the
boundary.
bwplot(bmi ~ income_category, data = vars)
This boxplot illustrates the distribution of BMI across Income
Categories. The chart shows that the median BMI is nearly constant
across all five income brackets, but all categories contain extreme
outliers, indicating high-BMI individuals exist across the entire
socioeconomic spectrum.
xyplot(bmi ~ age, groups = obese, data = vars,
auto.key = TRUE, type = c("p", "r"))
This scatterplot, showing BMI vs. Age, visually demonstrates the
difficulty of the classification task. The two classes (“No” in blue,
“Yes” in orange) show significant vertical overlap, especially in middle
age, meaning a person’s age provides little clean separation for
predicting their obesity status.
lm_temp <- lm(bmi ~ age + sex + race_ethnicity + income_category +
any_exercise_last_month + binge_drinking +
max_drinks_30day + interview_year,
data = vars)
xkabledply(vif(lm_temp))
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| age | 1.0893 | 1 | 1.0437 |
| sex | 1.1156 | 1 | 1.0562 |
| race_ethnicity | 1.1014 | 4 | 1.0121 |
| income_category | 1.1022 | 4 | 1.0122 |
| any_exercise_last_month | 1.0542 | 1 | 1.0267 |
| binge_drinking | 1.0169 | 1 | 1.0084 |
| max_drinks_30day | 1.1726 | 1 | 1.0828 |
| interview_year | 1.0094 | 1 | 1.0047 |
The key finding is that all GVIF (Generalized VIF) values are
very close to 1.0 (with the highest being 1.17 for max_drinks_30day).
This confirms that there is no significant multicollinearity among your
set of predictors, meaning that each variable contributes unique
information to the model and their effects can be reliably separated.
set.seed(123)
train_index <- createDataPartition(vars$obese, p = 0.7, list = FALSE)
train <- vars[train_index, ]
test <- vars[-train_index, ]
The dataset was randomly divided into training (70%) and testing
(30%) sets using stratified sampling to preserve the distribution of
obesity status. The training set was used for model development, while
the testing set was reserved for performance evaluation.
X_train <- model.matrix(obese ~ age + sex + race_ethnicity + income_category +
any_exercise_last_month + binge_drinking +
max_drinks_30day + interview_year,
data = train)[, -1]
X_test <- model.matrix(obese ~ age + sex + race_ethnicity + income_category +
any_exercise_last_month + binge_drinking +
max_drinks_30day + interview_year,
data = test)[, -1]
Categorical predictors were converted to numeric format using
one-hot encoding to support regression and regularization methods.
Feature matrices for the training and testing sets were aligned and
cleaned to ensure consistent predictor structures across all models.
glm_baseline <- glm(
obese ~ age + sex + race_ethnicity + income_category +
any_exercise_last_month + binge_drinking +
max_drinks_30day + interview_year,
data = train,
family = binomial
)
xkabledply(summary(glm_baseline))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -26.7242 | 16.1649 | -1.6532 | 0.0983 |
| age | 0.0110 | 0.0006 | 18.6780 | 0.0000 |
| sexFemale | -0.0511 | 0.0190 | -2.6888 | 0.0072 |
| race_ethnicityBlack NH | 0.5868 | 0.0385 | 15.2372 | 0.0000 |
| race_ethnicityOther NH | 0.0337 | 0.0433 | 0.7796 | 0.4356 |
| race_ethnicityMultiracial NH | 0.2308 | 0.0537 | 4.2962 | 0.0000 |
| race_ethnicityHispanic | 0.2466 | 0.0294 | 8.3975 | 0.0000 |
| income_category15-25k | 0.1413 | 0.0574 | 2.4607 | 0.0139 |
| income_category25-35k | 0.1724 | 0.0558 | 3.0921 | 0.0020 |
| income_category35-50k | 0.2312 | 0.0535 | 4.3210 | 0.0000 |
| income_category>50k | 0.2119 | 0.0486 | 4.3551 | 0.0000 |
| any_exercise_last_monthNo | 0.5918 | 0.0246 | 24.0939 | 0.0000 |
| binge_drinkingYes | -0.0967 | 0.0185 | -5.2353 | 0.0000 |
| max_drinks_30day | 0.0546 | 0.0035 | 15.4790 | 0.0000 |
| interview_year | 0.0123 | 0.0080 | 1.5357 | 0.1246 |
This logistic regression summary shows that nearly all
predictors are statistically significant (many at \(p < 0.0001\)), with lack of exercise
(coefficient 0.5918), Black NH race (coefficient 0.5868), age
(coefficient 0.0110), and max drinks (coefficient 0.0546) having the
strongest positive association with the log-odds of being obese. The
model suggests that the probability of obesity increases with age, max
drinks consumed, being in certain minority groups, and not exercising.
regfit_bmi <- regsubsets(
bmi ~ age + sex + race_ethnicity + income_category +
any_exercise_last_month + binge_drinking +
max_drinks_30day + interview_year,
data = train,
nvmax = 10
)
summary(regfit_bmi)
## Subset selection object
## Call: regsubsets.formula(bmi ~ age + sex + race_ethnicity + income_category +
## any_exercise_last_month + binge_drinking + max_drinks_30day +
## interview_year, data = train, nvmax = 10)
## 14 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## sexFemale FALSE FALSE
## race_ethnicityBlack NH FALSE FALSE
## race_ethnicityOther NH FALSE FALSE
## race_ethnicityMultiracial NH FALSE FALSE
## race_ethnicityHispanic FALSE FALSE
## income_category15-25k FALSE FALSE
## income_category25-35k FALSE FALSE
## income_category35-50k FALSE FALSE
## income_category>50k FALSE FALSE
## any_exercise_last_monthNo FALSE FALSE
## binge_drinkingYes FALSE FALSE
## max_drinks_30day FALSE FALSE
## interview_year FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## age sexFemale race_ethnicityBlack NH race_ethnicityOther NH
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) "*" " " " " " "
## 3 ( 1 ) "*" " " " " " "
## 4 ( 1 ) "*" " " "*" " "
## 5 ( 1 ) "*" " " "*" " "
## 6 ( 1 ) "*" "*" "*" " "
## 7 ( 1 ) "*" "*" "*" " "
## 8 ( 1 ) "*" "*" "*" " "
## 9 ( 1 ) "*" "*" "*" " "
## 10 ( 1 ) "*" "*" "*" " "
## race_ethnicityMultiracial NH race_ethnicityHispanic
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## income_category15-25k income_category25-35k income_category35-50k
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) "*" " " " "
## 10 ( 1 ) " " " " "*"
## income_category>50k any_exercise_last_monthNo binge_drinkingYes
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" " "
## 5 ( 1 ) " " "*" " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) "*" "*" "*"
## max_drinks_30day interview_year
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## 9 ( 1 ) "*" " "
## 10 ( 1 ) "*" " "
This exhaustive selection process for predicting continuous BMI
shows that ‘any_exercise_last_monthNo’ is the single best predictor,
followed by age, and then max_drinks_30day and ‘race_ethnicityBlack NH’.
The model requires at least 8 to 10 variables to reach the highest
Adjusted \(R^2\) (as seen in
regsubsets_Q1.png which shows the included variables), confirming that
many factors are needed but their combined linear effect on BMI is
inherently weak.
X_train_mat <- as.matrix(X_train)
y_train <- ifelse(train$obese == "Yes", 1, 0)
set.seed(123)
cvfit <- cv.glmnet(X_train_mat, y_train, family = "binomial", alpha = 1)
plot(cvfit)
Using cross-validation, LASSO was implemented to find the optimal
penalty (\(\lambda\)) that minimizes
out-of-sample error, resulting in a slightly higher AUC (0.596) than the
standard Logistic Regression (0.595). This process retains the strongest
predictors while shrinking the coefficients of less relevant variables
toward zero, improving model generalization.
tree_model <- rpart(
obese ~ age + sex + race_ethnicity + income_category +
any_exercise_last_month + binge_drinking +
max_drinks_30day + interview_year,
data = train,
method = "class"
)
pred_tree_prob <- predict(tree_model, newdata = test, type = "prob")[, "Yes"]
pred_tree_class <- factor(ifelse(pred_tree_prob > 0.5, "Yes", "No"), levels = c("No","Yes"))
conf_mat_tree <- confusionMatrix(pred_tree_class, test$obese, positive = "Yes")
print(conf_mat_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 18068 8230
## Yes 0 0
##
## Accuracy : 0.687
## 95% CI : (0.6814, 0.6927)
## No Information Rate : 0.687
## P-Value [Acc > NIR] : 0.503
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000
## Specificity : 1.000
## Pos Pred Value : NaN
## Neg Pred Value : 0.687
## Prevalence : 0.313
## Detection Rate : 0.000
## Detection Prevalence : 0.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : Yes
##
roc_tree <- roc(response = test$obese, predictor = pred_tree_prob, levels = c("No","Yes"))
print(auc(roc_tree))
## Area under the curve: 0.5
The Classification Tree model developed using rpart is highly
interpretable, explicitly showing that the primary split for predicting
obesity is based on ‘any_exercise_last_month’, followed by age. The tree
achieved the lowest AUC (0.543) of all models, indicating that its
simple, sequential decision logic sacrifices predictive accuracy for
interpretability.
set.seed(123)
rf_model <- randomForest(
obese ~ age + sex + race_ethnicity + income_category +
any_exercise_last_month + binge_drinking +
max_drinks_30day + interview_year,
data = train,
ntree = 500,
importance = TRUE
)
xkabledply(varImpPlot(rf_model))
| MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|
| age | 41.2191 | 1401.3539 |
| sex | 31.8586 | 146.0663 |
| race_ethnicity | 43.6812 | 413.9502 |
| income_category | 16.3846 | 416.5412 |
| any_exercise_last_month | 56.4230 | 343.5250 |
| binge_drinking | 6.9069 | 143.0563 |
| max_drinks_30day | 50.7769 | 699.7595 |
| interview_year | -2.9710 | 345.2759 |
pred_rf_prob <- predict(rf_model, newdata = test, type = "prob")[, "Yes"]
pred_rf_class <- predict(rf_model, newdata = test, type = "response")
conf_mat_rf <- confusionMatrix(pred_rf_class, test$obese, positive = "Yes")
print(conf_mat_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 17670 7812
## Yes 398 418
##
## Accuracy : 0.6878
## 95% CI : (0.6822, 0.6934)
## No Information Rate : 0.687
## P-Value [Acc > NIR] : 0.398
##
## Kappa : 0.0381
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.05079
## Specificity : 0.97797
## Pos Pred Value : 0.51225
## Neg Pred Value : 0.69343
## Prevalence : 0.31295
## Detection Rate : 0.01589
## Detection Prevalence : 0.03103
## Balanced Accuracy : 0.51438
##
## 'Positive' Class : Yes
##
roc_rf <- roc(response = test$obese, predictor = pred_rf_prob, levels = c("No","Yes"))
print(auc(roc_rf))
## Area under the curve: 0.5944
This ensemble method generated 500 trees and yielded an AUC of
0.594, which is comparable to the linear models but superior in
capturing non-linear effects. The Variable Importance plot confirms that
‘any_exercise_last_month’, ‘max_drinks_30day’, age, and ‘race_ethnicity’
are the most influential factors, showing the highest Mean Decrease
Gini.
roc_glm <- roc(test$obese, predict(glm_baseline, test, type = "response"))
roc_tree <- roc(test$obese, predict(tree_model, test, type = "prob")[, "Yes"])
roc_rf <- roc(test$obese, predict(rf_model, test, type = "prob")[, "Yes"])
auc_table <- tibble(
Model = c("Logistic Regression", "Classification Tree", "Random Forest"),
AUC = c(auc(roc_glm), auc(roc_tree), auc(roc_rf))
)
xkabledply(auc_table)
| Model | AUC |
|---|---|
| Logistic Regression | 0.5953 |
| Classification Tree | 0.5000 |
| Random Forest | 0.5944 |
This comparison demonstrates that tree-based ensemble methods,
like Random Forest, and Logistic Regression achieved nearly identical
predictive performance, hovering just under an AUC of 0.60. The
Classification Tree, due to its simplicity, performed the worst with an
AUC of 0.5000.
Weak Predictive Power: The model’s low AUC (\(\approx 0.60\)) and very low McFadden pseudo-\(R^2\) (\(\approx 0.02\)) confirm that the tested variables have weak explanatory power over obesity status.
Imbalance/Bias: Models show extremely high Specificity (\(\approx 0.98\)) but fail to detect the obese class due to critically low Sensitivity (\(\approx 0.03-0.05\)), indicating severe class imbalance and a conservative prediction bias toward the majority “No” class.
Insufficient Features: The low predictive success suggests crucial drivers of obesity (e.g., genetics, detailed diet) are not captured by the current BRFSS demographic and lifestyle features.
Address Imbalance: Implement undersampling to balance the “No” and “Yes” classes to improve Sensitivity and reduce prediction bias.
Feature Augmentation: Introduce better features (e.g., specific diet indicators or local environmental data) to increase the model’s overall signal.
Non-Linear Models: Test an SVM with a non-linear kernel to determine if a complex, curved boundary can better separate the overlapping data points.
Physical activity is widely recognized as a key behavioral
factor linked to weight outcomes and chronic disease risk. Using BRFSS
data from 2018–2023, this question evaluates how
strongly physical activity is associated with BMI and
whether the relationship remains after accounting for demographic and
health covariates.
Research Question:
To what extent can physical activity level predict an individual’s
BMI, while controlling for demographic covariates and survey year?
We first restricted the dataset to the variables needed for BMI,
physical activity, demographics, health status indicators, and survey
year. Variables were recoded into appropriate factor formats and ordered
where meaningful (e.g., interview year, BMI category, age group,
education). We also created several binary indicators to support
modeling: obesity status (obese), exercise
(exercise), diabetes (diabetes), heart attack
history (heart_attack), coronary heart disease history
(coronary), stroke history (stroke), and
self-reported health (self_health). These transformations
ensure consistent interpretation across regression and classification
analyses.
library(dplyr)
library(survey)
library(ggplot2)
library(tidyr)
library(ezids)
library(car)
library(pROC)
library(pscl)
library(corrplot)
brfss = readRDS("../Data/Processed/brfss_2018_2023.rds")
brfss_1 = brfss[,c(1:8,12:16,19:22,24)]
str(brfss_1)
## tibble [134,283 × 18] (S3: tbl_df/tbl/data.frame)
## $ weight_final : num [1:134283] 4060 1048 518 674 932 ...
## $ strata : num [1:134283] 11012 11021 11031 11042 11041 ...
## $ psu : num [1:134283] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ interview_year : num [1:134283] 2018 2018 2018 2018 2018 ...
## $ bmi : num [1:134283] 30.6 34.3 31.7 32 23.7 ...
## $ bmi_category : Factor w/ 4 levels "Underweight",..: 4 4 4 4 2 2 3 3 2 2 ...
## $ overweight_or_obese : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 2 2 1 1 ...
## $ any_exercise_last_month: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 2 2 2 1 ...
## $ sex : Factor w/ 2 levels "Male","Female": 1 2 1 2 1 2 1 2 2 1 ...
## $ age : num [1:134283] 49 33 45 57 46 74 48 58 79 41 ...
## $ age_group : Factor w/ 13 levels "18-24","25-29",..: 6 3 6 8 6 11 6 8 12 5 ...
## $ race_ethnicity : Factor w/ 5 levels "White NH","Black NH",..: 5 1 1 1 1 1 1 1 1 1 ...
## $ education_level : Factor w/ 6 levels "No school/K",..: 4 6 6 5 6 6 5 3 5 5 ...
## $ diabetes_status : Factor w/ 4 levels "Yes","Yes, only pregnancy",..: 4 3 3 1 3 3 3 3 3 3 ...
## $ heart_attack_history : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ coronary_hd_history : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ stroke_history : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ self_reported_health : Factor w/ 2 levels "Good or Better",..: 1 1 1 2 1 1 1 2 1 1 ...
brfss_1$psu = as.factor(brfss_1$psu)
brfss_1$psu <- factor(brfss_1$psu)
brfss_1$interview_year = as.factor(brfss_1$interview_year)
brfss_1$interview_year <- factor(brfss_1$interview_year, order=T,
levels = c("2018","2019","2020","2021","2022","2023"))
brfss_1$bmi_category = as.factor(brfss_1$bmi_category)
brfss_1$bmi_category <- factor(brfss_1$bmi_category, order=T,
levels = c("Underweight","Normal weight","Overweight","Obese"))
brfss_1$overweight_or_obese = as.factor(brfss_1$overweight_or_obese)
brfss_1$overweight_or_obese <- factor(brfss_1$overweight_or_obese, order=T,
levels = c("Yes","No"))
brfss_1$any_exercise_last_month = as.factor(brfss_1$any_exercise_last_month)
brfss_1$any_exercise_last_month <- factor(brfss_1$any_exercise_last_month, order=T,
levels = c("No","Yes"))
brfss_1$sex = as.factor(brfss_1$sex)
brfss_1$sex <- factor(brfss_1$sex, order=F)
brfss_1$age_group = as.factor(brfss_1$age_group)
brfss_1$age_group <- factor(brfss_1$age_group, order=T,
levels = c("18-24","25-29","30-34","35-39","40-44","45-49",
"50-54","55-59","60-64","65-69","70-74","75-79","80 or older"))
brfss_1$race_ethnicity = as.factor(brfss_1$race_ethnicity)
brfss_1$race_ethnicity <- factor(brfss_1$race_ethnicity, order=F)
brfss_1$education_level = as.factor(brfss_1$education_level)
brfss_1$education_level <- factor(brfss_1$education_level, order=T,
levels = c("No school/K","Grades 1-8","Grades 9-11",
"Grade 12/GED","College 1-3 years","College 4+ years"))
brfss_1$diabetes_status = as.factor(brfss_1$diabetes_status)
brfss_1$diabetes_status <- factor(brfss_1$diabetes_status, order=T,
levels = c("No","Yes","Pre-diabetes","Yes, only pregnancy"))
brfss_1$coronary_hd_history = as.factor(brfss_1$coronary_hd_history)
brfss_1$coronary_hd_history <- factor(brfss_1$coronary_hd_history, order=T,
levels = c("No","Yes"))
brfss_1$stroke_history = as.factor(brfss_1$stroke_history)
brfss_1$stroke_history <- factor(brfss_1$stroke_history, order=T,
levels = c("No","Yes"))
brfss_1$self_reported_health = as.factor(brfss_1$self_reported_health)
brfss_1$self_reported_health <- factor(brfss_1$self_reported_health, order=T,
levels = c("Good or Better","Fair or Poor"))
brfss_1 <- brfss_1 %>%
mutate(obese = ifelse(bmi >= 30, 1, 0),
exercise = ifelse(any_exercise_last_month == "No", 0, 1),
diabetes = ifelse(diabetes_status == "No", 0, 1),
heart_attack = ifelse(heart_attack_history == "No", 0, 1),
coronary = ifelse(coronary_hd_history == "No", 0, 1),
stroke = ifelse(stroke_history == "No", 0, 1),
self_health = ifelse(self_reported_health == "Fair or Poor", 0, 1))
str(brfss_1)
## tibble [134,283 × 25] (S3: tbl_df/tbl/data.frame)
## $ weight_final : num [1:134283] 4060 1048 518 674 932 ...
## $ strata : num [1:134283] 11012 11021 11031 11042 11041 ...
## $ psu : Factor w/ 67055 levels "2018000001","2018000002",..: 74 161 316 386 395 645 675 712 718 870 ...
## $ interview_year : Ord.factor w/ 6 levels "2018"<"2019"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bmi : num [1:134283] 30.6 34.3 31.7 32 23.7 ...
## $ bmi_category : Ord.factor w/ 4 levels "Underweight"<..: 4 4 4 4 2 2 3 3 2 2 ...
## $ overweight_or_obese : Ord.factor w/ 2 levels "Yes"<"No": 1 1 1 1 2 2 1 1 2 2 ...
## $ any_exercise_last_month: Ord.factor w/ 2 levels "No"<"Yes": 2 1 2 1 2 2 1 1 1 2 ...
## $ sex : Factor w/ 2 levels "Male","Female": 1 2 1 2 1 2 1 2 2 1 ...
## $ age : num [1:134283] 49 33 45 57 46 74 48 58 79 41 ...
## $ age_group : Ord.factor w/ 13 levels "18-24"<"25-29"<..: 6 3 6 8 6 11 6 8 12 5 ...
## $ race_ethnicity : Factor w/ 5 levels "White NH","Black NH",..: 5 1 1 1 1 1 1 1 1 1 ...
## $ education_level : Ord.factor w/ 6 levels "No school/K"<..: 4 6 6 5 6 6 5 3 5 5 ...
## $ diabetes_status : Ord.factor w/ 4 levels "No"<"Yes"<"Pre-diabetes"<..: 3 1 1 2 1 1 1 1 1 1 ...
## $ heart_attack_history : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ coronary_hd_history : Ord.factor w/ 2 levels "No"<"Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ stroke_history : Ord.factor w/ 2 levels "No"<"Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ self_reported_health : Ord.factor w/ 2 levels "Good or Better"<..: 1 1 1 2 1 1 1 2 1 1 ...
## $ obese : num [1:134283] 1 1 1 1 0 0 0 0 0 0 ...
## $ exercise : num [1:134283] 1 0 1 0 1 1 0 0 0 1 ...
## $ diabetes : num [1:134283] 1 0 0 1 0 0 0 0 0 0 ...
## $ heart_attack : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
## $ coronary : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
## $ stroke : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
## $ self_health : num [1:134283] 1 1 1 0 1 1 1 0 1 1 ...
To assess relationships among numeric and binary predictors, we
produced a Spearman correlation matrix. Spearman
correlation is appropriate here because several variables are binary
indicators and many relationships may be monotonic but not strictly
linear. The correlation heatmap provides a quick check for potentially
redundant predictors and highlights which health indicators tend to move
together.
brfss_nums = brfss_1[,c(5,10,19:25)]
str(brfss_nums)
## tibble [134,283 × 9] (S3: tbl_df/tbl/data.frame)
## $ bmi : num [1:134283] 30.6 34.3 31.7 32 23.7 ...
## $ age : num [1:134283] 49 33 45 57 46 74 48 58 79 41 ...
## $ obese : num [1:134283] 1 1 1 1 0 0 0 0 0 0 ...
## $ exercise : num [1:134283] 1 0 1 0 1 1 0 0 0 1 ...
## $ diabetes : num [1:134283] 1 0 0 1 0 0 0 0 0 0 ...
## $ heart_attack: num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
## $ coronary : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
## $ stroke : num [1:134283] 0 0 0 0 0 0 0 0 0 0 ...
## $ self_health : num [1:134283] 1 1 1 0 1 1 1 0 1 1 ...
spearman_corr <- cor(brfss_nums, use = "complete.obs", method = "spearman")
print(spearman_corr)
## bmi age obese exercise diabetes
## bmi 1.00000000 0.14542682 0.79974685 -0.10712513 0.16409697
## age 0.14542682 1.00000000 0.07732375 -0.10636663 0.18201116
## obese 0.79974685 0.07732375 1.00000000 -0.10899996 0.14803636
## exercise -0.10712513 -0.10636663 -0.10899996 1.00000000 -0.09700948
## diabetes 0.16409697 0.18201116 0.14803636 -0.09700948 1.00000000
## heart_attack 0.04499111 0.14826329 0.03779305 -0.05594274 0.11108864
## coronary 0.05060258 0.16309989 0.04128937 -0.04817855 0.10572123
## stroke 0.02079017 0.10939054 0.01845566 -0.05279714 0.07140363
## self_health -0.11682448 -0.08864314 -0.12276598 0.17764596 -0.18239053
## heart_attack coronary stroke self_health
## bmi 0.04499111 0.05060258 0.02079017 -0.11682448
## age 0.14826329 0.16309989 0.10939054 -0.08864314
## obese 0.03779305 0.04128937 0.01845566 -0.12276598
## exercise -0.05594274 -0.04817855 -0.05279714 0.17764596
## diabetes 0.11108864 0.10572123 0.07140363 -0.18239053
## heart_attack 1.00000000 0.43611656 0.15926854 -0.13718744
## coronary 0.43611656 1.00000000 0.11924987 -0.13466444
## stroke 0.15926854 0.11924987 1.00000000 -0.12064862
## self_health -0.13718744 -0.13466444 -0.12064862 1.00000000
corrplot(spearman_corr, method = "color", type = "upper", tl.cex = 0.8)
We first modeled BMI as a continuous outcome to estimate how
average BMI differs between people who were physically active vs not
active. This establishes the baseline direction and magnitude of
association between exercise and BMI.
fit1 <- lm(bmi ~ exercise, data = brfss_1)
summary(fit1)
##
## Call:
## lm(formula = bmi ~ exercise, data = brfss_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.985 -4.088 -0.815 2.942 62.452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.86472 0.04135 722.20 <2e-16 ***
## exercise -1.99688 0.04484 -44.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.859 on 134281 degrees of freedom
## Multiple R-squared: 0.01455, Adjusted R-squared: 0.01455
## F-statistic: 1983 on 1 and 134281 DF, p-value: < 2.2e-16
xkabledply(fit1, title = paste("Model (factor):", format(formula(fit1)) ) )
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 29.8647 | 0.0414 | 722.1979 | 0 |
| exercise | -1.9969 | 0.0448 | -44.5342 | 0 |
xkablevif(fit1)
| exercise |
|---|
| 1 |
plot(fit1)
This model estimates the mean difference in BMI between the
exercise groups. A statistically significant exercise coefficient would
suggest that physically active adults have a different average BMI than
inactive adults, but this model does not yet adjust for age or health
status.
To reduce confounding, we expanded the model to include age and
self-reported health. This allows us to test whether the association
between exercise and BMI persists after accounting for important
demographic and health differences.
fit2 <- lm(bmi ~ exercise + age + self_health, data = brfss_1)
summary(fit2)
##
## Call:
## lm(formula = bmi ~ exercise + age + self_health, data = brfss_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.125 -3.903 -0.847 2.895 62.836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.43237 0.07887 385.87 <2e-16 ***
## exercise -1.51155 0.04528 -33.38 <2e-16 ***
## age 0.02743 0.00102 26.89 <2e-16 ***
## self_health -2.42398 0.05475 -44.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.797 on 134279 degrees of freedom
## Multiple R-squared: 0.03521, Adjusted R-squared: 0.03519
## F-statistic: 1633 on 3 and 134279 DF, p-value: < 2.2e-16
xkabledply(fit2, title = paste("Model (num):", format(formula(fit2)) ) )
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 30.4324 | 0.0789 | 385.8721 | 0 |
| exercise | -1.5115 | 0.0453 | -33.3803 | 0 |
| age | 0.0274 | 0.0010 | 26.8850 | 0 |
| self_health | -2.4240 | 0.0547 | -44.2774 | 0 |
xkablevif(fit2)
| age | exercise | self_health |
|---|---|---|
| 1.017 | 1.042 | 1.038 |
plot(fit2)
Including covariates helps interpret the exercise coefficient as
the estimated association with BMI while holding age and self-reported
health constant.
We further included additional health indicators such as
diabetes and stroke history. This model assesses whether exercise
remains associated with BMI even after controlling for comorbidity
patterns that are strongly linked with weight status.
fit3 <- lm(bmi ~ exercise + age + self_reported_health + diabetes + stroke, data = brfss_1)
summary(fit3)
##
## Call:
## lm(formula = bmi ~ exercise + age + self_reported_health + diabetes +
## stroke, data = brfss_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.379 -3.861 -0.817 2.874 62.945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.08406 0.06604 440.410 < 2e-16 ***
## exercise -1.39941 0.04493 -31.144 < 2e-16 ***
## age 0.01936 0.00103 18.804 < 2e-16 ***
## self_reported_health.L 1.42341 0.03904 36.462 < 2e-16 ***
## diabetes 2.99388 0.05968 50.163 < 2e-16 ***
## stroke -0.57539 0.12720 -4.523 6.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.743 on 134277 degrees of freedom
## Multiple R-squared: 0.05301, Adjusted R-squared: 0.05298
## F-statistic: 1503 on 5 and 134277 DF, p-value: < 2.2e-16
xkablevif(fit3)
| age | diabetes | exercise | self_reported_health.L | stroke |
|---|---|---|---|---|
| 1.056 | 1.07 | 1.045 | 1.076 | 1.028 |
plot(fit3)
## Interaction Models
Because the relationship between exercise and
BMI may differ across health status groups, we tested interaction terms.
Interaction models evaluate whether the effect of one predictor changes
depending on another predictor (e.g., exercise effect differing across
self-reported health categories).
mixfit1 = lm(bmi ~ exercise + age + exercise:self_health, data = brfss_1)
summary(mixfit1)
##
## Call:
## lm(formula = bmi ~ exercise + age + exercise:self_health, data = brfss_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.888 -3.911 -0.853 2.893 62.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.496980 0.063932 445.74 <2e-16 ***
## exercise 0.755831 0.075228 10.05 <2e-16 ***
## age 0.028389 0.001019 27.85 <2e-16 ***
## exercise:self_health -2.827551 0.065666 -43.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.799 on 134279 degrees of freedom
## Multiple R-squared: 0.03445, Adjusted R-squared: 0.03443
## F-statistic: 1597 on 3 and 134279 DF, p-value: < 2.2e-16
mixfit2 <- lm(bmi ~ exercise + age + self_reported_health + diabetes * stroke, data = brfss_1)
summary(mixfit2)
##
## Call:
## lm(formula = bmi ~ exercise + age + self_reported_health + diabetes *
## stroke, data = brfss_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.413 -3.862 -0.815 2.873 62.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.08356 0.06604 440.42 < 2e-16 ***
## exercise -1.39861 0.04493 -31.13 < 2e-16 ***
## age 0.01929 0.00103 18.73 < 2e-16 ***
## self_reported_health.L 1.42325 0.03904 36.46 < 2e-16 ***
## diabetes 3.03401 0.06085 49.86 < 2e-16 ***
## stroke -0.34379 0.14448 -2.38 0.017335 *
## diabetes:stroke -1.01526 0.30038 -3.38 0.000725 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.743 on 134276 degrees of freedom
## Multiple R-squared: 0.0531, Adjusted R-squared: 0.05305
## F-statistic: 1255 on 6 and 134276 DF, p-value: < 2.2e-16
A significant interaction term would indicate that the
association between predictors is not constant across groups (e.g.,
exercise relates differently to BMI for people with poor vs good
health).
To evaluate categorical relationships, we created a contingency
table between physical activity and BMI category and performed a
chi-squared test. This tests whether BMI category distributions differ
between physically active and inactive groups.
bmi_activitytable = xtabs(~ any_exercise_last_month + bmi_category, data = brfss_1)
bmi_activitytable
## bmi_category
## any_exercise_last_month Underweight Normal weight Overweight Obese
## No 307 4571 6599 8595
## Yes 1254 35789 44383 32785
chisqactivity = chisq.test(bmi_activitytable)
chisqactivity
##
## Pearson's Chi-squared test
##
## data: bmi_activitytable
## X-squared = 1708.6, df = 3, p-value < 2.2e-16
A small p-value supports the conclusion that BMI category and
physical activity are not independent, meaning exercise status is
associated with BMI classification categories.
We repeated the chi-squared procedure for self-reported health
and BMI category to examine whether perceived health is related to BMI
classification.
bmi_selfhealthtable = xtabs(~ self_reported_health + bmi_category, data = brfss_1)
bmi_selfhealthtable
## bmi_category
## self_reported_health Underweight Normal weight Overweight Obese
## Good or Better 1314 37663 47259 35171
## Fair or Poor 247 2697 3723 6209
chisqselfhealth = chisq.test(bmi_selfhealthtable)
chisqselfhealth
##
## Pearson's Chi-squared test
##
## data: bmi_selfhealthtable
## X-squared = 2170.8, df = 3, p-value < 2.2e-16
We also tested self-reported health against the overweight/obese
indicator to examine whether perceived health differs across weight
status.
bmi_selfhealthtable2 = xtabs(~ self_reported_health + overweight_or_obese, data = brfss_1)
bmi_selfhealthtable2
## overweight_or_obese
## self_reported_health Yes No
## Good or Better 82430 38977
## Fair or Poor 9932 2944
chisqselfhealth2 = chisq.test(bmi_selfhealthtable2)
chisqselfhealth2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: bmi_selfhealthtable2
## X-squared = 462.46, df = 1, p-value < 2.2e-16
To directly compare mean BMI across physical activity groups, we
conducted a two-sample t-test. This tests whether average BMI differs
between adults who reported exercising in the last month versus those
who did not.
t.test(brfss_1$bmi ~ brfss_1$any_exercise_last_month)
##
## Welch Two Sample t-test
##
## data: brfss_1$bmi by brfss_1$any_exercise_last_month
## t = 38.255, df = 24830, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 1.894568 2.099195
## sample estimates:
## mean in group No mean in group Yes
## 29.86472 27.86784
mean_bmi = aggregate(bmi ~ any_exercise_last_month, data = brfss_1, FUN = mean)
mean_bmi
## any_exercise_last_month bmi
## 1 No 29.86472
## 2 Yes 27.86784
A significant t-test indicates a meaningful difference in
average BMI between exercise groups, and the group means provide the
direction of that difference.
In addition to modeling BMI directly, we used logistic
regression to examine predictive ability in a classification setting. We
first modeled physical activity status from BMI (and additional
covariates), and evaluated performance using ROC curves and AUC values.
activityLogit <- glm(any_exercise_last_month ~ bmi, data = brfss_1, family = "binomial")
summary(activityLogit)
##
## Call:
## glm(formula = any_exercise_last_month ~ bmi, family = "binomial",
## data = brfss_1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.219666 0.035791 89.96 <2e-16 ***
## bmi -0.051425 0.001188 -43.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113281 on 134282 degrees of freedom
## Residual deviance: 111476 on 134281 degrees of freedom
## AIC: 111480
##
## Number of Fisher Scoring iterations: 4
prob = predict(activityLogit, type = "response")
brfss_1$prob = prob
h <- roc(any_exercise_last_month ~ prob, data = brfss_1)
auc(h)
## Area under the curve: 0.5867
plot(h)
activityLogit2 <- glm(any_exercise_last_month ~ bmi + age + self_reported_health, data = brfss_1, family = "binomial")
summary(activityLogit2)
##
## Call:
## glm(formula = any_exercise_last_month ~ bmi + age + self_reported_health,
## family = "binomial", data = brfss_1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.2591001 0.0454041 71.78 <2e-16 ***
## bmi -0.0405948 0.0012316 -32.96 <2e-16 ***
## age -0.0163612 0.0005011 -32.65 <2e-16 ***
## self_reported_health.L -0.7728496 0.0148626 -52.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113281 on 134282 degrees of freedom
## Residual deviance: 107655 on 134279 degrees of freedom
## AIC: 107663
##
## Number of Fisher Scoring iterations: 4
prob2 = predict(activityLogit2, type = "response")
brfss_1$prob2 = prob2
i <- roc(any_exercise_last_month ~ prob2, data = brfss_1)
auc(i)
## Area under the curve: 0.6505
plot(i)
ROC/AUC provides a threshold-independent measure of discrimination.
AUC values closer to 1 indicate better separation, while values near 0.5
indicate weak discrimination.
To directly connect physical activity to obesity status, we fit
logistic regression models predicting obesity from exercise and
covariates. We evaluated performance using ROC/AUC and also computed
McFadden’s pseudo-R² as a measure of explanatory strength.
activityLogit3 <- glm(obese ~ exercise + age + self_health, data = brfss_1, family = "binomial")
summary(activityLogit3)
##
## Call:
## glm(formula = obese ~ exercise + age + self_health, family = "binomial",
## data = brfss_1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0802824 0.0285175 -2.815 0.00487 **
## exercise -0.4907626 0.0161551 -30.378 < 2e-16 ***
## age 0.0068095 0.0003838 17.742 < 2e-16 ***
## self_health -0.6946082 0.0192019 -36.174 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165871 on 134282 degrees of freedom
## Residual deviance: 162650 on 134279 degrees of freedom
## AIC: 162658
##
## Number of Fisher Scoring iterations: 4
prob3 = predict(activityLogit3, type = "response")
brfss_1$prob3 = prob3
j <- roc(obese ~ prob3, data = brfss_1)
auc(j)
## Area under the curve: 0.5963
plot(j)
activityLogit3pr2 = pR2(activityLogit3)
## fitting null model for pseudo-r2
activityLogit3pr2
## llh llhNull G2 McFadden r2ML
## -8.132504e+04 -8.293541e+04 3.220745e+03 1.941719e-02 2.369941e-02
## r2CU
## 3.341554e-02
The sign and significance of the exercise coefficient indicate
whether exercise is associated with lower odds of obesity after
adjusting for age and self-reported health. AUC summarizes predictive
discrimination, while McFadden’s pseudo-R² provides a model-fit
comparison measure.
To explore whether a smaller set of predictors performs
comparably, we applied exhaustive best subset selection using AIC to
identify a parsimonious logistic model.
library(bestglm)
# -------------------------------
# Model 1
# -------------------------------
brfss_3 <- brfss_1[, c(10, 19:25)]
# Convert all columns to numeric (required for bestglm)
brfss_3 <- data.frame(lapply(brfss_3, function(x) as.numeric(unlist(x))))
# Force binary response 0/1
brfss_3[, 1] <- ifelse(brfss_3[, 1] == 1, 1, 0)
# Remove missing values
brfss_3 <- na.omit(brfss_3)
res.bestglm1 <- bestglm(
Xy = brfss_3,
family = binomial,
IC = "AIC",
method = "exhaustive"
)
summary(res.bestglm1)
## Fitting algorithm: AIC-glm
## Best Model:
## df deviance
## Null Model 134276 76041.30
## Full Model 134282 84853.63
##
## likelihood-ratio test - GLM
##
## data: H0: Null Model vs. H1: Best Fit AIC-glm
## X = 8812.3, df = 6, p-value < 2.2e-16
res.bestglm1$BestModels
## age obese exercise diabetes heart_attack coronary stroke Criterion
## 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE 76053.30
## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE 76055.30
## 3 FALSE TRUE TRUE TRUE FALSE TRUE TRUE 76306.89
## 4 TRUE TRUE TRUE TRUE FALSE TRUE TRUE 76308.89
## 5 FALSE TRUE TRUE TRUE TRUE FALSE TRUE 76367.12
# -------------------------------
# Model 2
# -------------------------------
brfss_2 <- brfss_1[, c(10, 5, 19:25)]
# Convert all columns to numeric
brfss_2 <- data.frame(lapply(brfss_2, function(x) as.numeric(unlist(x))))
# Force binary response 0/1
brfss_2[, 1] <- ifelse(brfss_2[, 1] == 1, 1, 0)
# Remove missing values
brfss_2 <- na.omit(brfss_2)
res.bestglm <- bestglm(
Xy = brfss_2,
family = binomial,
IC = "AIC",
method = "exhaustive"
)
summary(res.bestglm)
## Fitting algorithm: AIC-glm
## Best Model:
## df deviance
## Null Model 134275 75734.53
## Full Model 134282 84853.63
##
## likelihood-ratio test - GLM
##
## data: H0: Null Model vs. H1: Best Fit AIC-glm
## X = 9119.1, df = 7, p-value < 2.2e-16
res.bestglm$BestModels
## age bmi obese exercise diabetes heart_attack coronary stroke Criterion
## 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 75748.53
## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 75750.53
## 3 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE 75783.88
## 4 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE 75785.88
## 5 FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE 76004.66
We also fit expanded logistic models including additional
comorbidity indicators (diabetes, stroke, coronary disease, heart attack
history). Model evaluation included ROC/AUC and confusion matrices to
summarize classification behavior at a default threshold (0.5).
activityLogit4 <- glm(obese ~ exercise + age + self_health + diabetes + stroke + heart_attack + coronary,
data = brfss_3, family = "binomial")
summary(activityLogit4)
##
## Call:
## glm(formula = obese ~ exercise + age + self_health + diabetes +
## stroke + heart_attack + coronary, family = "binomial", data = brfss_3)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.02548 0.02206 1.155 0.2481
## exercise -0.47770 0.01627 -29.358 < 2e-16 ***
## age NA NA NA NA
## self_health -0.57976 0.01990 -29.128 < 2e-16 ***
## diabetes 0.90007 0.02102 42.826 < 2e-16 ***
## stroke -0.11945 0.04814 -2.481 0.0131 *
## heart_attack 0.05146 0.04348 1.183 0.2366
## coronary 0.16596 0.04257 3.899 9.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165871 on 134282 degrees of freedom
## Residual deviance: 161075 on 134276 degrees of freedom
## AIC: 161089
##
## Number of Fisher Scoring iterations: 4
prob4 = predict(activityLogit4, type = "response")
brfss_3$prob4 = prob4
k <- roc(obese ~ prob4, data = brfss_3)
auc(k)
## Area under the curve: 0.5925
plot(k)
activityLogit4pr2 = pR2(activityLogit4)
## fitting null model for pseudo-r2
activityLogit4pr2
## llh llhNull G2 McFadden r2ML
## -8.053732e+04 -8.293541e+04 4.796181e+03 2.891516e-02 3.508664e-02
## r2CU
## 4.947124e-02
activityLogit5 <- glm(obese ~ exercise + age + self_health + diabetes + stroke + coronary,
data = brfss_3, family = "binomial")
summary(activityLogit5)
##
## Call:
## glm(formula = obese ~ exercise + age + self_health + diabetes +
## stroke + coronary, family = "binomial", data = brfss_3)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.02759 0.02199 1.255 0.2095
## exercise -0.47803 0.01627 -29.383 < 2e-16 ***
## age NA NA NA NA
## self_health -0.58125 0.01986 -29.262 < 2e-16 ***
## diabetes 0.90141 0.02099 42.951 < 2e-16 ***
## stroke -0.11304 0.04782 -2.364 0.0181 *
## coronary 0.18701 0.03865 4.839 1.31e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165871 on 134282 degrees of freedom
## Residual deviance: 161076 on 134277 degrees of freedom
## AIC: 161088
##
## Number of Fisher Scoring iterations: 4
prob5 = predict(activityLogit5, type = "response")
brfss_3$prob5 = prob5
k <- roc(obese ~ prob5, data = brfss_3)
auc(k)
## Area under the curve: 0.5922
plot(k)
activityLogit5pr2 = pR2(activityLogit5)
## fitting null model for pseudo-r2
activityLogit5pr2
## llh llhNull G2 McFadden r2ML
## -8.053802e+04 -8.293541e+04 4.794784e+03 2.890674e-02 3.507660e-02
## r2CU
## 4.945709e-02
library(ModelMetrics)
xkabledply(confusionMatrix(actual = activityLogit5$y, predicted = activityLogit5$fitted.values),
title = "Confusion matrix from Logit Model")
| 1 | 2 |
|---|---|
| 89332 | 36988 |
| 3571 | 4392 |
The confusion matrix complements AUC by showing how predictions
behave under a specific threshold. Together with pseudo-R², these
outputs summarize both discrimination and overall explanatory strength.
BMI is self-reported in BRFSS, which can introduce measurement error and bias.
Physical activity is measured as a binary “any exercise last month” indicator, which does not capture intensity, frequency, or duration, limiting predictive power.
Models may be affected by omitted variables (diet quality, environment, access to recreation, and socioeconomic detail), and associations should not be interpreted as causal.
EReplace binary physical activity with richer measures (minutes per week or activity intensity) if available to better capture behavioral variation.
Include broader socioeconomic and behavioral covariates (income, race/ethnicity, alcohol use, smoking) to reduce confounding.
Test survey-weighted models (BRFSS sampling weights) for more accurate population-level inference.
The COVID-19 pandemic introduced major disruptions to daily
life, healthcare access, and emotional wellbeing. Because obesity is
often associated with increased depression risk, we examine whether the
obesity–depression relationship changed after the onset of
COVID-19 using BRFSS 2018–2023.
Research Question:
Did the relationship between obesity and depression change after the
onset of COVID-19, controlling for key demographics (age group and sex)?
We created binary indicators for the primary variables. Obesity
status was recoded from overweight_or_obese into a numeric
indicator where 0 = non-obese and 1 =
obese. Depression history was recoded into a numeric indicator
where 0 = no and 1 = yes using
depression_history. To capture pandemic timing, we created
a COVID period indicator where pre-COVID = 0 (interview_year ≤
2019) and post-COVID = 1 (interview_year ≥
2020).
The final modeling dataset included depress, obese, covid_period,
age_group, and sex, and missing values were removed prior to modeling.
library(tidyverse)
brfss <- readRDS("../Data/Processed/brfss_2018_2023.rds")
brfss$overweight_or_obese <- as.character(brfss$overweight_or_obese)
brfss <- brfss %>%
mutate(
obese = if_else(overweight_or_obese == "Yes", 1L, 0L),
covid_period = if_else(interview_year <= 2019, 0L, 1L),
depress = if_else(depression_history == "Yes", 1L, 0L)
)
model_data <- brfss %>%
dplyr::select(depress, obese, covid_period, age_group, sex) %>%
tidyr::drop_na()
To test whether COVID changed the obesity–depression
relationship, we fit a logistic regression model with
an interaction term between obesity and COVID
period:
depress∼obese∗covid_period+age_group+sex
The interaction term obese:covid_period evaluates whether the effect of obesity on depression differs between pre-COVID and post-COVID.
model_int_fixed <- glm(
depress ~ obese * covid_period + age_group + sex,
data = model_data,
family = binomial()
)
summary(model_int_fixed)
##
## Call:
## glm(formula = depress ~ obese * covid_period + age_group + sex,
## family = binomial(), data = model_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.74450 0.02846 -61.289 < 2e-16 ***
## obese 0.26203 0.02614 10.023 < 2e-16 ***
## covid_period 0.18991 0.02613 7.268 3.64e-13 ***
## age_group25-29 -0.16236 0.02757 -5.888 3.91e-09 ***
## age_group30-34 -0.20490 0.02793 -7.335 2.22e-13 ***
## age_group35-39 -0.33792 0.02877 -11.747 < 2e-16 ***
## age_group40-44 -0.46151 0.02994 -15.415 < 2e-16 ***
## age_group45-49 -0.47088 0.03091 -15.233 < 2e-16 ***
## age_group50-54 -0.51027 0.03089 -16.518 < 2e-16 ***
## age_group55-59 -0.56573 0.03174 -17.823 < 2e-16 ***
## age_group60-64 -0.54175 0.03274 -16.546 < 2e-16 ***
## age_group65-69 -0.61494 0.03637 -16.905 < 2e-16 ***
## age_group70-74 -0.63598 0.04426 -14.370 < 2e-16 ***
## age_group75-79 -0.76590 0.06457 -11.862 < 2e-16 ***
## age_group80 or older -1.22174 0.09648 -12.663 < 2e-16 ***
## sexFemale 0.99397 0.01413 70.341 < 2e-16 ***
## obese:covid_period -0.07460 0.03163 -2.359 0.0183 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136267 on 134282 degrees of freedom
## Residual deviance: 130191 on 134266 degrees of freedom
## AIC: 130225
##
## Number of Fisher Scoring iterations: 4
# Odds Ratios (Pre-COVID vs Post-COVID)
To make results
interpretable, we computed odds ratios (ORs) for
obesity separately in each COVID period. The pre-COVID
OR uses only the obesity coefficient, while the
post-COVID OR combines the obesity coefficient with the
interaction term before exponentiating.
OR_pre <- exp(coef(model_int_fixed)["obese"])
OR_post <- exp(
coef(model_int_fixed)["obese"] +
coef(model_int_fixed)["obese:covid_period"]
)
OR_pre
## obese
## 1.299559
OR_post
## obese
## 1.206138
or_df <- data.frame(
Period = c("Pre-COVID", "Post-COVID"),
Odds_Ratio = c(as.numeric(OR_pre), as.numeric(OR_post))
)
ggplot(or_df, aes(x = Period, y = Odds_Ratio)) +
geom_bar(stat = "identity", fill = "orange") +
geom_hline(yintercept = 1, linetype = "dashed") +
labs(
title = "Change in Obesity–Depression Association Pre vs Post COVID",
x = "Period",
y = "Odds Ratio (Obese vs Non-obese)"
) +
theme_minimal()
Interpretation: If the odds ratio is
greater than 1, obese adults have higher odds of
depression than non-obese adults. If OR_post is higher than
OR_pre, the obesity–depression association strengthened
after COVID began; if OR_post is lower, the association
weakened slightly post-COVID.
Odds ratios are helpful, but predicted probabilities are often
easier to interpret. We estimated predicted depression probabilities for
four groups: 1) Non-obese, Pre-COVID 2) Obese, Pre-COVID 3) Non-obese,
Post-COVID 4) Obese, Post-COVID
To isolate obesity and COVID effects, predictions were generated
holding age_group and sex constant at a reference level (the first
factor level in the dataset).
pred_grid <- expand.grid(
obese = c(0, 1),
covid_period = c(0, 1),
age_group = levels(model_data$age_group)[1],
sex = levels(model_data$sex)[1]
)
pred_grid$pred_prob <- predict(
model_int_fixed,
newdata = pred_grid,
type = "response"
)
pred_grid <- pred_grid %>%
mutate(
Obesity_Status = if_else(obese == 1, "Obese", "Non-Obese"),
Period = if_else(covid_period == 1, "Post-COVID", "Pre-COVID")
)
ggplot(pred_grid, aes(x = Period, y = pred_prob, fill = Obesity_Status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Predicted Probability of Depression by Obesity and COVID Period",
x = "Period",
y = "Predicted Probability of Depression"
) +
theme_minimal()
Interpretation: If predicted depression
probabilities rise in the post-COVID period for both obesity groups,
that suggests overall depression risk increased after the pandemic. If
the gap between obese and non-obese changes from pre to post, that
indicates the relationship between obesity and depression shifted across
periods.
Predicted Probability with 95% Confidence Intervals
To quantify
uncertainty, we computed 95% confidence intervals around predicted
probabilities using standard errors on the log-odds scale and
transforming back to probability.
pred_probs <- predict(model_int_fixed, newdata = pred_grid, type = "link", se.fit = TRUE)
pred_grid$fit <- plogis(pred_probs$fit)
pred_grid$lwr <- plogis(pred_probs$fit - 1.96 * pred_probs$se.fit)
pred_grid$upr <- plogis(pred_probs$fit + 1.96 * pred_probs$se.fit)
ggplot(pred_grid, aes(x = Period, y = fit, color = Obesity_Status)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1) +
geom_line(aes(group = Obesity_Status)) +
labs(
title = "Predicted Probability of Depression with 95% CI",
y = "Predicted Probability"
) +
theme_minimal()
Self-Reported Outcome: Depression history is self-reported and may include reporting bias; it also does not measure severity.
Reference-Level Predictions: Predicted probabilities are computed at a fixed reference age_group and sex, so absolute probability values differ for other demographic profiles.
Omitted Variables: This model controls only for age group and sex; additional covariates (income, race/ethnicity, exercise, alcohol use, comorbidities) may further explain depression risk and change the estimated obesity effect.
Expanded Controls: Add socioeconomic and behavioral covariates to test robustness of the obesity effect.
Subgroup Analysis: Fit the model separately by sex or age_group to check for heterogeneous effects.
Survey Design Considerations: If survey weights are available in the BRFSS file, incorporate them to improve population-level inference.
- Centers for Disease Control and Prevention. (2023). BRFSS
overview. https://www.cdc.gov/brfss/annual_data/2023/pdf/Overview_2023-508.pdf
- Centers for Disease Control and Prevention. (2021). Behavioral
Risk Factor Surveillance System comparability of data BRFSS 2020.
https://www.cdc.gov/brfss/annual_data/2020/pdf/compare-2020-508.pdf